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We discuss two new approaclies to extract relevant biological information on tlie 
Transcription Factors (and in particular to identify their binding sequences) from 
the statistical distribution of oligonucleotides in the upstream region of the genes. 
Both the methods are based on the notion of a "regulatory network" responsible 
for the various expression patterns of the genes. In particular we concentrate 
on families of coregulated genes and look for the simultaneous presence in the 
upstream regions of these genes of the same set of transcription factor binding sites. 
We discuss two instances which well exemplify the features of the two methods: 
the coregulation of glycolysis in Drosophila melanogaster and the diauxic shift in 
Saccharomyces cerevisiae. 



1 Introduction 

As more and more complete genomic sequences are being decoded it is becom- 
ing of crucial importance to understand how the gene expression is regulated. 
A central role in our present understanding of gene expression is played by the 
notion of "regulatory network" . It is by now clear that a particular expression 
pattern in the cell is the result of an intricate network of interactions among 
genes and proteins which cooperate to enhance (or depress) the expression 
rate of the various genes. It is thus important to address the problem of gene 
expression at tlieJeicfil of the whole regulatory network and not at the level of 
the single genfjHu'QB. 

In particular, most of the available information about such interactions 
concerns the transcriptional regulation of protein coding genes. Even if this 
is not the only regulatory mechanism of gene expression in eukaryotes it is 
certainly the most widespread one. 



1 



In these last years, thanks to the impressive progress in the DNA array 
technology several results on these regulatory networks have been obtained. 
Various transcription factors (TF's in the following) have been identified and 
their binding motifs in the DNA chain (see below for a discussion) have been 
characterized. However it is clear that we are only at the very beginning of 
such a program and that much more work still has to be done in order to 
reach a satisfactory understanding of the regulatory network in eukaryotes 
(the situation is somehow better for the prokaryotes whose regulatory network 
is much simpler). 

In this contribution we want to discuss a new method which allows to re- 
construct these interactions by comparing existing biological information with 
the statistical properties of the sequence data. This is a line of research which 
has been pursued in the last few years, with remarkable results, by several 
grouf^iDJthjijyorld. For a (unfortunately largely incomplete) list of references 
see B'LlU'Lrlj'Q'LlQ. In particular the biological input that we shall use is the fact 
that some genes, being involved in the same biological process, are likely to be 
"coregulated" i.e. they should show the same expression pattern. The simplest 
way for this to happen is that they are all regulated by the same set of TF's. 
If this is the case we should find in the upstream^ region of these genes the 
same TF binding sequences. This is a highly non trivial occurrence from a sta- 
tistical point of view and could in principle be recognized by simple statistical 
analysis. 

As a matter of fact the situation is much more complex than what appears 
from this idealized picture. TF's not necessarily bind only to the upstream 
region. They often recognize more than one sequence (even if there is usually 
a "core" sequence which is highly conserved). Coregulation could be achieved 
by a complex interaction of several TF's instead than following the simple 
pattern suggested above. Notwithstanding this, we think that it is worthwhile 
to explore this simplified picture of coregulation, for at least three reasons. 

• Even if in this way we only find a subset of the TF's involved in the coreg- 
ulation, this would be all the same an important piece of information: It 
would add a new link in the regulatory network that we are studying. 

• Analyses based on this picture, being very simple, can be easily per- 
formed on any gene set, from the few genes involved in the Glycolysis 
(the first example that we shall discuss below) up to the whole genome 
(this will be the case of the second example that we shall discuss). This 

"With this term we denote the portion of the DNA chain which is immediately before the 
starting point of the open reading frame (ORF). We shall characterize this region more 
precisely in sect.H below. 
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feature is going to be more and more important as more and more DNA 
array experiment appear in the literature. As the quantity of available 
data increases, so does the need of analytical tools to analyze it. 

• Such analyses could be easily improved to include some of the features 
outlined above, keeping into account, say, the sequence variability or the 
synergic interaction of different TF's. 

To this end we have developed two different (and complementary) ap- 
proaches. The first one (which we shall discuss in detail in sect.^ below) fol- 
lows a more traditional line of reasoning: we start from a set of genes which 
arc known to be coregulated (this is our "biological input") and then try to 
recognize the possible binding sites for the TF's. We call this approach the 
"direct search" for coregulating TF's. 

The second approacki which we shall briefly sketch in sect.^ below and is 
discussed in full detail inllj) is completely different and is particularly suitable 
for the study of genome-wide DNA array experiments. In this case the bio- 
logical input is taken into account only at the end of the analysis. We start 
by organizing all the genes in sets on the basis of the overrepresented common 
sequences and then filter them with expression patterns of some DNA array 
experiment. We call this second approach the "inverse search" for coregulating 
TF's. 

It is clear that all the candidate gene interactions which we identify with 
our two methods have to be tested experimentally. However our results may 
help selecting among the huge number of possible candidates and could be 
used as a preliminary test to guide the experiments. 

This contribution is organized as follows. In sect.|^ we shall briefly intro- 
duce the reader to the main features of the regulatory network (this introduc- 
tion will necessarily be vepi short, the interested reader can find a thorough 
discussion for instance in til). We shall then devote sect.^ and ^ to explain 
our "direct" and "inverse" search methods respectively. Then we shall discuss 
two instances which well exemplify the two strategies. First in sect|| we shall 
study the coregulation of glycolysis in Drosophila melanogaster. Second, in 
sectj^ we shall discuss the diauxic shift in Saccharomyces cerevisiae. The last 
section will be devoted to some concluding remarks. 

2 Transcription factors. 

As mentioned in the introduction, a major role in the regulatory network is 
played by the Transcription Factors, which may have in general a twofold action 
on gene transcription. They can activate it by recruiting the transcription 
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machinery to the transcription starting site by binding enhancer sequences 

in the upstream noncoding region, or by modifying chromatine stnicture, but 
they can also repress it by negatively interfering with the transcriptional control 
mechanisms. 

The main point is that in both cases TFs act by binding to specific, often 
sliort DNA sequences in the upstream noncoding region. It is exactly this 
feature which allows TF's to perform a specific regulatory functions. These 
binding sequences can be considered somehow as the fingerprints of the various 
TF's. The main goal of our statistical analysis will be the identification and 
characterization of such binding sites. 

2. 1 Classification 

Even if TF's show a wide variability it is possible to try a (very rough) clas- 
sification. Let us see it in some more detail, since it will help understanding 
the examples which we shall discuss in the following sections. There are four 
main classes of binding sites in eukaryotes. 

• Promoters 

These arc localized in the region immediately upstream of the coding 
region (often within 200 bp from the transcription starting point). They 
can be of two types: 

- short sequences like the well known CCAAT-box, TATA-box, GC- 
box which are not tissue specific and are recognized by ubiquitous 

TFs 

— tissue specific sequences which are only recognized by tissue specific 
TFs 

• Response Elements 

These appear only in those genes whose expression is controlled by an 
external factor (like hormones or growing factors). These are usually 
within Ikb from the transcription starting point. Binding of a response 
element with the appropriate factor may induce a relevant enhancement 
in the expression of the corresponding gene 

• Enhancers 

these are regulatory elements which, differently from the promoters, can 
act in both orientations and (to a large extent) at any distance from the 
transcription starting point (there are examples of enhancers located even 
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50-60kb upstream). They enhance the expression of the corresponding 
gene. 



• Silencers 

Same as the enhancers, but their effect is to repress the expression of the 
gene. 

2.2 Combinatorial regulation. 

The main feature of TF's activity is its "combinatorial" nature. This means 
that: 

• a single gene is usually regulated by many independent TF's which bind 
to sites which may be very far from each other in the upstream region. 

• it often happens that several TF's must be simultaneously present in 
order to perform their regulatory function. This phenomenon is usually 
referred to as the "Recruitment model for gene activation" (for a review 
seen) and represents the common pattern of action of the TF's. It is 
so important that it has been recently adopted as guiding principle for 
various computer based approaches to detect regulatory sites (see for 
instancea). 

• the regulatory activity of a particular TF is enhanced if it can bind to 
several (instead of only one) binding sites in the upstream region. This 
"overrepresentation" of a given binding sequence is also used in some 
algorithms which aim to identify TF's. It will also play a major role in 
our approach. 

3 The "direct" search method. 

In this case the starting point is the selection of a set of genes which are known 
to be involved in the same biological process (see example of sect. ||). 
Let us start by fixing a few notations: 

• Let us denote with M the number of genes in the coregulated set and 
with gi, i — 1 ■ ■ ■ M the genes belonging to the set 

• Let us denote with L the number of base pairs (bp) of the upstream non 
coding region on which we shall perform our analysis. It is important to 
define precisely what we mean by "upstream region" With this term we 
denote the non coding portion of the DNA chain which is immediately 
before the transcription start site. This means that we do not consider as 
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part of this region the UTR5 part of the ORF of the gene in which we are 
interested. If we choose L large enough it may happen that other ORFs 
are present in the upstream region. In this case we consider as upstream 
region only the non coding part of the DNA chain up to the nearest ORF 
(even if it appears in the opposite strand). Thus L should be thought of 
as an upper cutoff. In most cases the length of the upstream region is 
much smaller and is gene dependent. We shall denote it in the following 
as L{g). 

• In this upstream region we shall be interested in studying short sequences 
of nucleotides which we shall call words. Let n be the length of such a 
word. For each value of n we have iV = 4" possible words Wi, i — 1 ■ ■ ■ N . 
The optimal choice of n (i.e. the one which optimize the statistical 
significance of our analysis) is a function of L and M . We shall see 
some typical values in the example of sect|| In the following we shall 
have to deal with words of varying size. When needed, in order to avoid 
confusion, we shall call fc-word a word made of k nucleotides. 

Let us call U the collection of upstream regions of the M genes gi, . . . Qm- Our 
goal is to see if the number of occurrences of a given word Wi in each of the 
upstream regions belonging to U shows a "statistically significant" deviation 
(to be better defined below) from what expected on the basis of pure chance. 
To this end we perform two types of analyses. 

First level of analysis. 

This first type of analysis is organized in three steps 

• Construction of the "Reference samples" . The first step is the con- 
struction of a set of p "reference samples" which we call Ri,i = 1, • • - p. 
The Ri are nonoverlapping sequences of nucleotides each, extracted 
from a noncoding portion of the DNA sequence in the same region of 
the genome to which the genes that we study belong but "far" from any 
ORF. From these reference samples we then extract for each word the 
"background occurrence probability" that we shall then use as input of 
the second step of our analysis. The rationale behind this approach is the 
idea that the coding and regulating parts of the genome are immersed 
in a large background sea of "silent" DNA and that we may recognize 
that a portion of DNA has a biological function by looking at statisti- 
cal deviations in the word occurrences with respect to the background. 
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However it is clear that this is a rather crude description of the genome, 
in particular there are some obvious objections to this approach: 

— There is no clear notion of what "far" means. As we mentioned 
in the introduction one can sometimes find TF's which keep their 
regulatory function even if they bind to sites which are as far as 
~ 50kb from the ORF 

— It is possible that in the reference samples the nucleotide frequencies 
reflect some unknown biological function thus inducing a bias in the 
results 

— It is not clear how should one deal with the long repeated sequences 
which very often appear in the genome of eukaryotes 

We shall discuss below how to overcome these objections. 

• Background probabilities. For each word w we study the number of 
occurrences n(w, i) in the i*^ sample. They will follow a Poisson distribu- 
tion from which wc extract the background occurrence probability of the 
word. This method works only if p and are large enough with respect 
to the number of possible words N (we shall see in the example below 
some typical values for p and Ljj). However wc have checked that our 
results are robust with respect to different choices of these background 
probabilities. 

• Significant words. From these probabilities we can immediately con- 
struct for each n-word the expected number of occurrences in each of 
the upstream sequences of U and from them the probabilities p{n, s) of 
finding at least one n-word simultaneously present in the upstream re- 
gion of s (out of the M) genes. By suitably tuning L, s and n we may 
reach very low probabilities. If notwithstanding such a low probability 
we indeed find a n-word which appears in the upstream region of .s genes 
then we consider this fact as a strong indication of its role as binding 
sequence for a TF's. We may use the probability p{n, s) as an estimate 
of the significance of such a candidate binding sequence. 

As we have seen the critical point of this analysis is in the choice of the 
reference sample. We try to avoid the bias induced by this choice by crossing 
the above procedure with a second level of analysis 

Second level of analysis. 
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The main change with respect to the previous analysis is that in this case we 
extract the reference probabihties for the n-words from an artificial reference 
sample constructed with a Markov chain algorithm based on the frequencies 
of A:- words with k « n (usually k — 1, 2 or 3) extracted from the upstream 
regions themselves. Then the second and third step of the previous analysis 
follow unchanged. The rationale behind this second approach is that we want 
to see if in the upstream region there are some n-words (with n = 7 or 8, 
say) that occur much more often than what one would expect based on the 
frequency of the /c-words in the same region. 

These two levels of analysis are both likely to give results that are biased 
according to the different choices of reference probabilities that define them. 
However, since these biases are likely to be very different from each other, it 
is reasonable to expect that by comparing the results of the two methods one 
can minimize the number of false positives found. 



4 The "inverse" search method. 

A major drawback of the analysis discussed in the previous section is that it 
requires a precise knowledge of the function of the genes examined. As a matter 
of fact a large part of the genes of eukaryotes have no precisely know biological 
function and could not be studied with our direct method. Moreover in these 
last years the richest source of biological information on gene expression comes 
form microarray experiments, thus it would be important to have a tool to 
study gene coregulation starting from the output of such experiments. These 
two observations suggested us the inverse search method that we shall briefly 
discuss in this section. We shall outline here only the main ideas of the method, 
a detailed account can be found int3. 

The method we propose has two main steps: first the ORFs of an eukaryote 
genome are grouped in (overlapping) sets based on words that are overrepre- 
sented in their upstream region, with respect to their frequencies in a reference 
sample which is made of all the upstream regions of the whole genome. Each 
set is labelled by a word. Then for each of these sets the average expression 
in one ore more microarray experiments are compared to the genome- wide av- 
erage: if a statistically significant difference is found, the word that labels the 
set is a candidate regulatory site for the genes in the set, either enhancing or 
inhibiting their expression. 

An important feature is that the grouping of the genes into sets depends 
only on the upstream sequences and not on the microarray experiment consid- 
ered: It needs to be done only once for each organism, and can then be used 
to analyse an arbitrary number of microarray experiments. 
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Table 1: Genes involved in the glycolysis. 



Gene 


Description 


Locus 


Chromosome 


Aid 


Aldolase 


AE003755 


3R 


Eno 


Enolase 


AE003585 


2L 


Gapdhl 


Glyceraldehyde 3-ph. dehydrogenase 1 


AE003839 


2R 


Gapdh2 


Glyceraldehyde 3-ph. dehydrogenase 2 


AE003500 


X 


Hex 


Hexokinase 


AE003756 


3R 


ImpL3 


L-lactate dehydrogenase 


AE003563 


3L 


Pfk 


6-phosphofructokinase 


AE003755 


2R 



We refer toE3 for a detailed description of how the sets are constructed, we 
only stress here that this construction only requires three external parameters 
which must be fixed by the user: the length L of the upstream region (see 
sectjl for a discussion of this parameter), the length n of the words that we 
use to group the sets and a cutoff probability P which quantifies the notion of 
"overrepresentation" mentioned above. 

5 Example: glycolysis in Drosophila melanogaster. 

As an example of the analysis of sect.^, we studied the 7 genes of Drosophila 
melanogaster involved in glycolysis. These genes are listed in Tab.l. We per- 
formed our analysis with two choices of the parameters: 

1] Promoter region. In this first test we decided to concentrate in the 
promoter region. Thus we chose L < 100. With this choice, and since 
M — 7, we are bound to study n-words with n = 3, 4, 5 in order to have a 
reasonable statistical significance. In particular we concentrate on n = 3 
In the first level of analysis we chose Ln = 100 and p = 1000 (p is the 
number of reference samples). In the second level of analysis we chose 
k — I (k being the number of nucleotides of the fc- words used to construct 
the Markov chain). We found (among few other motifs which we do not 
discuss here for brevity) that a statistically relevant signal is reached by 
the sequence GAG. This result has a clear biological interpretation since 
it is the binding site of an ubiquitous TF known as GAGA factor which 
belongs to the class of the so called "zinc finger" TF's^. We consider this 
finding as a good validation test of the whole procedure. 

''The commonly assumed binding site for the GAGA factor is the sequence GAGAG, howerar 
it has been recently realized that the minimal binding sequence is actually the 3- word GAGI13. 
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Table 2: Probability p(n, 7) of finding a n-word in the upstream region of all the 7 genes 
involved in glycolysis. In the first column the value of n, in the second the result obtained 
using the background probabilities. In the last two columns the result obtained with the 
Markov chains with fc = 1 and k = 2 respectively. 



n 


p{n,7) 


p{n, 7), k = 1 


p{n,7), k = 2 


6 


0.346 


0.76 


0.78 


7 


0.007 


0.013 


0.022 


8 


0.00025 


0.000034 


0.00011 



2] large scale analysis 

In this second test we chose L — 5000. This aUowed us to address n- 
words with n = 6, 7, 8. For the reference samples we used Lji = 5000 and 
p = 21 As a result of our analysis we obtained the probabilities p(n, s) 
of finding at least one n-word in the upstream region of s out of the 7 
genes that we are studying. As an example we list in Tab. 2 the values of 
p{n, s) for s = 7 and n = 6, 7, 8. For the Markov chain analysis we used 
/c = 1,2. 

In this case we found a 7- word which appeared in the upstream region of 
all the seven genes: A fact that, looking at the probabilities listed in tab. 3 
certainly deserves more attention. The word is TTTAAAT. A survey 
in the literature shows that this is indeed one of the binding sequences of 
a TF known as "even-skipped" which is known to regulate segmentation 
(and also the development of certain neurons) in Drosophila. This TF has 
been widely studied due to its crucial role in the early stages of embryo 
development, but it was not directly related up to now to the regulation 
of glycolysis. 



6 Example: diauxic shift in S. cerevisiae 

As an example of the analysis of sect.^, we studied the so called diauxic shift, 
{i.e. the metabolic shift from fermentation to respiration), in S. cerevisiae 
the pattern of gene expression during the shift was measured with DNA mi- 
croarrays techniques in Ref. 113. In the experiment gene expression levels were 
measured for virtually all the genes of at seven time-points while glucose in 
the medium was progressively depleted. As a result of our analysis we found 
29 significant words, that can be grouped into 6 motifs {i.e. groups of similar 
words). Five of them correspond to known regulatory motifs (for a database of 
known and putative TF's binding sites in S. cerevisiae see ref.cl). In particular 
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three of them: STRE, MIGl and UME6 (for the meaning of these abbrevi- 
ations see again □) were previously known to be involved in glucose-induced 
regulation process, while for the two other known motifs: PAC and RRPE 
this was a new result. We consider the fact of having found known regulatory 
motifs a strong validation of our method. 

Finally we also found a new binding sequence: ATAAGGG, which we 
could not associate to any known regulatory motif. 

7 Conclusions 

We have proposed two new methods to extract biological information on the 
Transcription Factors (and more generally on the mutual interactions among 
genes) from the statistical distribution of oligonucleotides in the upstream re- 
gion of the genes. Both are based on the notion of a "regulatory network" 
responsible for the various expression patterns of the genes, and aim to find 
common binding sites for TFs in families of coregulated genes. 

• The methods can be applied both to selected sets of genes of known 
biological functions (direct search method) or to the genome wide mi- 
croarrays experiments (inverse search method). 

• They require a complete knowledge of the upstream oligonucleotide se- 
quences and thus they can be applied for the moment only to those 
organisms for which the complete genoma has been sequenced. 

• In the direct method, once the set of coregulated genes has been chosen, 
no further external input is needed. The significance criterion of our 
candidates binding sites only depends on the statistical distribution of 
oligonucleotides in the upstream region (or in nearby regions used as test 
samples) 

• Both can be easily implemented and could be used as standard prelimi- 
nary tests, to guide a more refined analysis. 

Even if they already give interesting results, both our methods are far from 
being optimized. In particular there are three natural directions of improve- 
ment. 

a] Taking into account the variability of the binding sequences, 

b] Recognizing dyad like binding sequences (see for instance 0) which are 

rather common in eukaryotes. 
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c] Recognizing synergic interactions between TF's. 

Work is in progress along these lines. 

Needless to say the candidate binding sequences that we find with our 
method will have to be tested experimentally. However our method could help 
to greatly reduce the number of possible candidates and could be used as a 
guiding line for experiments. 
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